Getting started with geospatial coding in R using the terra and sf packages

Author

Jason Flower

Published

May 29, 2025

Abstract

This is the first session of the ICCB workshop titled “Open Source Geospatial Tools for Conservation under Climate Change - a Koala Case Study”. We will provide an introduction to working with geospatial in R using the sf and terra packages.

No prior experience of spatial data is assumed, but this introduction will not have time to delve deeply into some important aspects of spatial data such as projections. We will use the two most commonly used R packages for geospatial data manipulation: sf for manipulating vector data, and terra for manipulating raster and vector data. If you are still using the raster package, you should move to terra; it is simpler, faster and can do more!

Resources:

Prerequisites

You will need the terra and sf packages installed. We will also make an interactive map with terra, which requires the leaflet package to be installed, and we will need the dplyr package for data manipulation.

install.packages(c("sf", "terra", "leaflet", "dplyr"))

If you have problems, there are more details about installing terra here and sf here.

We can now load the packages:

library(dplyr)
library(terra)
library(sf)

Spatial data

There are basically two types of spatial data: vector and raster

Vector data

Can be points, lines or polygons. Vectors are useful for representing things like survey locations, rivers, and boundaries.

Code
pts <- rbind(c(3.2,4), c(3,4.6), c(3.8,4.4), c(3.5,3.8), c(3.4,3.6), c(3.9,4.5)) |>
  vect()

lnes <- as.lines(vect(rbind(c(3,4.6), c(3.2,4), c(3.5,3.8)))) |>
  rbind(as.lines(vect(rbind(c(3.9, 4.5), c(3.8, 4.4), c(3.5,3.8), c(3.4,3.6)))))

lux <- vect(system.file("ex/lux.shp", package = "terra"))

par(mfrow = c(1,3))

plot(pts, axes = F, main = "Points")
plot(lnes, col = "blue", axes = F, main = "Lines")
plot(lux, "NAME_2", col = terrain.colors(12), las = 1, axes = F, main = "Polygons")

Raster data

Raster data is a grid of rectangles, normally called cells. Each cell has a value, making rasters useful for storing continuous data, such as temperature and elevation.

Here is an example of raster data, where each cell in the raster represents elevation.

Code
par(mfrow = c(1,1))

elev <- system.file("ex/elev.tif", package = "terra") |>
  rast() |>
  aggregate(fact = 2)

plot(elev, las = 1, main = "Elevation map", col = terrain.colors(100))

elev |>
  as.polygons(aggregate = FALSE, na.rm = FALSE) |>
  lines(col = "grey40", lwd = 0.2)

Getting started

Making and inspecting a raster

Lets start by creating our own raster. We will be doing all manipulation of rasters with the terra package. To create rasters from scratch or load them from a file we use the function rast(). We can create a simple raster by specifying the x and y limits for the raster and the resolution (how big each cell is).

#create raster
ras <- rast(xmin = 0, xmax = 10, ymin = 0, ymax = 10, resolution = 2)

#see what we've created
ras
class       : SpatRaster 
dimensions  : 5, 5, 1  (nrow, ncol, nlyr)
resolution  : 2, 2  (x, y)
extent      : 0, 10, 0, 10  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 

The figure below shows what most of the terms above refer to. As you can see, you don’t need to use all the terms to define a raster. Some other important points to note:

  • Every object in R has a class, such as data.frame and as you can see, rasters in terra are of class SpatRaster.
  • We did not tell rast() which coordinate reference system to use, so it defaults to using longitude latitude coordinates, also known as EPSG 4326. We will come back to coordinate reference systems later.

But what does the raster we created actually look like when plotted. Lets see. All we need is plot()

plot(ras)

Why is there no plot? Because the raster we created is empty; there are no values associated with the the cells. Lets assign some values to each cell in the raster and try again. First we will find out how many cells are in our raster using `ncell()

ncell(ras)
[1] 25

Ok, now we know this lets give our raster cells values from 1 to 25:

values(ras) <- 1:25

plot(ras)

Now our raster has values, we get a plot! Each cell has an integer value between 1 and 25, with cell values increasing from left to right and top to bottom. So the values start being “filled up” in the top left, and finish in the bottom right.

Lets have another look at our raster properties

ras
class       : SpatRaster 
dimensions  : 5, 5, 1  (nrow, ncol, nlyr)
resolution  : 2, 2  (x, y)
extent      : 0, 10, 0, 10  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 
source(s)   : memory
name        : lyr.1 
min value   :     1 
max value   :    25 

We can now see a few extra pieces of information compared to last time:

  • sources(s): where is the data held on your computer? It says memory for this raster, indicating that the raster is in the computer memory. Rasters can also be held on your hard disk, in which case this will be the file name of the raster. We won’t go into details here, but terra is smart about loading data into memory, only doing so when it needs to and it thinks it will have enough space.
  • name: what is the raster called?
  • min value & max value: the minimum and maximum values in the raster

Ok, now we understand the basic structure of a raster, lets look at vector data using the sf package.

Making and inspecting a vector

As mentioned earlier, vector data can be points, lines or polygons. Let’s start by making a single spatial point and plot it:

pt <- st_point(c(1,3))

plot(pt, axes = TRUE)

Simple! But this is just a point. What if we want our point to have some information attached to it? For example, what the temperature is at that point. Well first we need to convert it into a simple feature collection:

pt_sf <- pt |> 
  st_sfc() |> 
  st_as_sf()

pt_sf

Inspecting the simple feature collection, pt_sf, that we created, we see that there is only 1 feature; our point. There is also a bounding box, which is the same concept as the x and y limits we had for our raster. Like the raster, we also have coordinate reference system (CRS), that is currently not defined.

Now our point is a simple feature collection, we can add some information to it. We will add a column called “temperature” and give our point a value of 25.

pt_sf$temperature <- 25

pt_sf

Now our point has a “field” attached to it with temperature data. The fields are simply columns that have information for each geometry; in our case a point.

Real world data

Vectors

Loading and plotting

Most of the time, we won’t be making our own data from scratch, but reading in data that we have downloaded or been provided with. In this workshop, you are going to be doing a case study of Koalas in south-east Queensland (SEQ). One of the first pieces of spatial data we normally need are the boundaries of the area we are working in. We will use spatial data of the local government areas (LGAs) in Queensland downloaded from the Queensland government website to define our SEQ boundary.

We use st_read() to read the data which is in the data folder you should have downloaded with the code from Github. The file is a Geopackage, which is widely used for storing geospatial data, and is similar (but better!) than shapefile. The st_read() command can read data in many different spatial formats: run st_drivers() to see a complete list of formats.

lgas <- st_read("data/Local_Goverment_Areas.gpkg")
Reading layer `Local_Government_Areas' from data source 
  `/home/jason/R_projects/ICCB-spatial-intro/data/Local_Goverment_Areas.gpkg' 
  using driver `GPKG'
Simple feature collection with 78 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 137.9946 ymin: -29.17927 xmax: 153.5519 ymax: -9.087991
Geodetic CRS:  GDA94

sf gives us some information about the data we have read: we can see it is MULTIPOLYGON and has 78 features and 6 fields, i.e. 78 rows of geometry and 6 columns of data. Lets have a look at the first few rows (features):

head(lgas)

Now we can see that the 6 fields contain information about each LGA, including various name formats and the areas. The final column called “Shape” contains the spatial information: the coordinates for each point that makes up the polygons. This column is most commonly labelled “geometry”.

The fields in our sf data are just like columns in a data frame. So if we want all the values in one column, we can just select that column in the same way as a data frame:

lgas$lga
 [1] "Aurukun Shire"                    "Balonne Shire"                   
 [3] "Banana Shire"                     "Barcaldine Regional"             
 [5] "Barcoo Shire"                     "Blackall Tambo Regional"         
 [7] "Boulia Shire"                     "Brisbane City"                   
 [9] "Burdekin Shire"                   "Burke Shire"                     
[11] "Cairns Regional"                  "Carpentaria Shire"               
[13] "Cassowary Coast Regional"         "Central Highlands Regional"      
[15] "Charters Towers Regional"         "Maranoa Regional"                
[17] "Mareeba Shire"                    "Mckinlay Shire"                  
[19] "Moreton Bay City"                 "Mornington Shire"                
[21] "Mount Isa City"                   "Murweh Shire"                    
[23] "Napranum Aboriginal Shire"        "Noosa Shire"                     
[25] "North Burnett Regional"           "Northern Peninsula Area Regional"
[27] "Palm Island Aboriginal Shire"     "Paroo Shire"                     
[29] "Pormpuraaw Aboriginal Shire"      "Quilpie Shire"                   
[31] "Redland City"                     "Richmond Shire"                  
[33] "Rockhampton Regional"             "Scenic Rim Regional"             
[35] "Somerset Regional"                "South Burnett Regional"          
[37] "Southern Downs Regional"          "Sunshine Coast Regional"         
[39] "Tablelands Regional"              "Toowoomba Regional"              
[41] "Torres Shire"                     "Torres Strait Island Regional"   
[43] "Townsville City"                  "Bulloo Shire"                    
[45] "Bundaberg Regional"               "Cherbourg Aboriginal Shire"      
[47] "Cloncurry Shire"                  "Cook Shire"                      
[49] "Croydon Shire"                    "Diamantina Shire"                
[51] "Isaac Regional"                   "Kowanyama Aboriginal Shire"      
[53] "Livingstone Shire"                "Lockhart River Aboriginal Shire" 
[55] "Lockyer Valley Regional"          "Logan City"                      
[57] "Longreach Regional"               "Mackay Regional"                 
[59] "Mapoon Aboriginal Shire"          "Doomadgee Aboriginal Shire"      
[61] "Douglas Shire"                    "Etheridge Shire"                 
[63] "Flinders Shire"                   "Fraser Coast Regional"           
[65] "Gladstone Regional"               "Gold Coast City"                 
[67] "Goondiwindi Regional"             "Gympie Regional"                 
[69] "Hinchinbrook Shire"               "Hope Vale Aboriginal Shire"      
[71] "Ipswich City"                     "Weipa Town"                      
[73] "Western Downs Regional"           "Whitsunday Regional"             
[75] "Winton Shire"                     "Woorabinda Aboriginal Shire"     
[77] "Wujal Wujal Aboriginal Shire"     "Yarrabah Aboriginal Shire"       

Let’s plot the data to see what we have:

plot(lgas)

Looks like Queensland! We got one map for each field (column). If we want a map of just one field we can select only the field we want:

plot(lgas[, "lga"], axes = TRUE)

We added axes using the axes = TRUE argument.

Coordinate reference systems and projection

Remember that the CRS of our data is GDA94. What does this mean? We can get some more information using st_crs():

st_crs(lgas)
Coordinate Reference System:
  User input: GDA94 
  wkt:
GEOGCRS["GDA94",
    DATUM["Geocentric Datum of Australia 1994",
        ELLIPSOID["GRS 1980",6378137,298.257222101,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["Horizontal component of 3D system."],
        AREA["Australia including Lord Howe Island, Macquarie Island, Ashmore and Cartier Islands, Christmas Island, Cocos (Keeling) Islands, Norfolk Island. All onshore and offshore."],
        BBOX[-60.55,93.41,-8.47,173.34]],
    ID["EPSG",4283]]

The “wkt”, well-known text, gives a lot of detail, but we can see that this is a CRS that is specific to Australia.

We often have to deal with spatial data that have different CRS’s, which involves transforming the data from one CRS to another. But first, what is a CRS?

Coordinate reference systems

Click here to expand this section

If we want to know where things are in space, we need to use some kind of spatial reference. We can use our own arbitrary system, e.g. a sampling grid like the one shown in the photo below where we could define the location of each square relative to one of the corners. But normally we want to know where something is on Earth.

There are two types of coordinate systems we can use to represent locations on Earth:

  • Geographic coordinate systems (GCS): uses a 3-D surface (e.g. globe) to define locations on the Earth using longitude and latitude. The 3-D surface is normally an ellipsoid which approximates the Earth but cannot be exact since the Earth is not a smooth surface. This approximation of the Earth is called a datum and can be aligned with the true Earth (the geoid) in different ways depending on whether we are trying to get a good approximation at some particular location (local datum), e.g. Brisbane, or best approximation across the whole Earth (geocentric datum). The figure below (sourced from here) shows examples of these datums.

A commonly used geocentric datum is World Geodetic Survey for 1984 (WGS84). This is almost synonymous with the commonly used coordinate reference system EPSG 4326, but EPSG 4326 defines the latitude and longitude coordinates used on the WGS84 ellipsoid (Ref)

  • Projected coordinate system (projection): Unfortunately, we can’t carry around globes all the time when we want to look at a map, and doing calculations, such as distance and area, in 3-D is much more difficult than 2-D. So we need a way of getting from our 3-D globe to a piece of paper (or for younger people, a screen). To do this we need to ‘project’ from a GCS to a projected coordinate system, which is called projection because we can think of this as putting a light in the centre of a globe and the light shines through the globe projecting features onto a flat piece of paper. The figure below (from QGIS docs) illustrates this, showing the 3 projection families:

a) Cylindrical; b) conical, and; c) planar projecions

All projections are a compromise because they will always distort the shape, area, distances, and directions of the original GCS. A map that preserves shape is called conformal; one that preserves area is called equal-area; one that preserves distance is called equidistant; and one that preserves direction is called azimuthal. There are a huge number of projections available and choosing the correct one can be challenging. Often your choice will be to use the a local projection that is used by goverment or other authorities in the location you are working, e.g. for this workshop we will use EPSG 3112, GDA94 / Geoscience Australia Lambert, which is appropriate for Australia. For global work where equal area is important, the Mollweide projection is commonly used.

We have only covered the basics of coordinate reference systems because it is a big topic to cover. The following resources are useful for understanding in more depth:

Coming back to our Queensland data, you will remember (or do st_crs(lgas)) that it is in the GDA94 CRS, which is geographic CRS. We want to transform it into a suitable projected CRS. We will be using GDA94 / Geoscience Australia Lambert, which is identified by the EPSG code 3112. Transforming between CRS and another is done using st_transform():

lgas_projected <- st_transform(lgas, 3112)

plot(lgas_projected[, "lga"], axes = TRUE)

The data looks the same, but our axes are no longer in degrees latitude and longitude. What are our units of measurement in our projected CRS? Normally they would be metres (unless you’re in some weird American projection and you’re using feet), but lets check:

st_crs(lgas_projected)$units_gdal
[1] "metre"

Yep, we’re in metres. Compare this to the units for the unprojected data:

st_crs(lgas)$units_gdal
[1] "degree"

Subsetting and merging

Our data has local government areas (LGAs) for the whole of Queensland, but we just want a polygon of south-east Queensland (SEQ). How do we get that?

First, we make a vector containing the names of all the LGAs in SEQ:

seq_lga_names <- c(
  "Brisbane City",
  "Moreton Bay City",
  "Logan City",
  "Ipswich City",
  "Redland City",
  "Scenic Rim Regional",
  "Somerset Regional",
  "Lockyer Valley Regional",
  "Gold Coast City",
  "Sunshine Coast Regional",
  "Toowoomba Regional",
  "Noosa Shire"
)

Now we can subset our LGAs spatial data for just these LGAs:

lgas_seq <- lgas_projected |> 
  filter(lga %in% seq_lga_names)

plot(lgas_seq[, "lga"])

Great! We’ve got only the LGAs in SEQ. But at the moment we have many polygons that make up SEQ:

head(lgas_seq)

How do we get just one polygon that is the boundary of the area? We use st_union():

seq_boundary <- st_union(lgas_seq)

plot(seq_boundary)

This merges all our polygons into one polygon. Note that we lose all the fields when we do this:

head(seq_boundary)
Geometry set for 1 feature 
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 1619136 ymin: -3349899 xmax: 1901228 ymax: -3107565
Projected CRS: GDA94 / Geoscience Australia Lambert
MULTIPOLYGON (((1854712 -3339933, 1854648 -3339...

Saving

We will need the SEQ polygon later, so lets save it using st_write():

st_write(seq_boundary, "data/seq_polygon.gpkg", append = FALSE)
Deleting layer `seq_polygon' using driver `GPKG'
Writing layer `seq_polygon' to data source `data/seq_polygon.gpkg' using driver `GPKG'
Writing 1 features with 0 fields and geometry type Multi Polygon.

You can save in many different formats (see st_drivers()). You can change the file format just by changing the file extension, e.g. use “seq_polygon.shp” if you want a shapefile.

We use the append = FALSE argument to overwrite a file with the same name. This is useful if you are making changes to code and want to make sure that the most up-to-date version of your data gets saved.

Rasters

Loading and plotting

We now want to get some climate data for our south-east Queensland study area that we will use in a later session to do species modelling. Climate data is normally stored in raster format, so we will be using the terra package to manipulate the data. The climate data that is already in the data folder. Let’s load one of the climate rasters and have a look at it:

climate1 <- rast("data/Current_climate_QLD/bioclim_01.tif")

climate1
class       : SpatRaster 
dimensions  : 681, 841, 1  (nrow, ncol, nlyr)
resolution  : 0.05, 0.05  (x, y)
extent      : 111.975, 154.025, -44.025, -9.975  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source      : bioclim_01.tif 
name        : bioclim_01 

It is a raster with 0.05 degree resolution: we know it is in degrees because it is unprojected, in the EPSG 4326 coordinate reference system. Let’s plot the data:

plot(climate1)

We can tell that this is probably temperature data in degrees celsius judging by the legend, and that it is for the whole of Australia. How can we inspect the values in more detail?

Raster values

We can get a histogram of all the values in a raster just using the hist() function:

hist(climate1)

There is a tail of low values, which are mostly from the south-east of Australia (Tasmania is chilly!). What is the mean value for the whole of Australia?

global(climate1, "mean", na.rm = TRUE)

We can also get a statistical summary of the raster values just by using the summary() function:

summary(climate1)
Warning: [summary] used a sample
   bioclim_01    
 Min.   : 4.854  
 1st Qu.:19.546  
 Median :22.583  
 Mean   :22.202  
 3rd Qu.:25.625  
 Max.   :29.441  
 NA's   :50934   

Raster math

The great thing about rasters are you can do maths with them! For example, doing climate1 + 1 just adds one to each raster value, and doing climate1*2 multiplies each raster value by two.

As an example, lets convert our temperature data into Fahrenheit for our confused colleagues from the U.S. The conversion from Celsius to Fahrenheit is: Fahrenheit = (Celsius * 1.8) + 32.

#do the conversion
climate1_fahrenheit <- (climate1*1.8) + 32

#plot our new raster
plot(climate1_fahrenheit)

Projecting

We only want data for south-east Queensland (SEQ). Our polygon of SEQ is in a local projection, EPSG 3112. To get our raster into the same projection, we need to project it. In terra we use the function project() which is equivalent to the st_transform() function from the sf package:

climate1_projected <- project(climate1, "EPSG:3112")

plot(climate1_projected)

It is really important to understand that when we project a raster, we are changing the raster, because we have to create a new raster in the new CRS.

Lets compare our original raster and the projected version:

climate1
class       : SpatRaster 
dimensions  : 681, 841, 1  (nrow, ncol, nlyr)
resolution  : 0.05, 0.05  (x, y)
extent      : 111.975, 154.025, -44.025, -9.975  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source      : bioclim_01.tif 
name        : bioclim_01 
climate1_projected
class       : SpatRaster 
dimensions  : 770, 924, 1  (nrow, ncol, nlyr)
resolution  : 5123.954, 5123.954  (x, y)
extent      : -2477612, 2256922, -5117360, -1171915  (xmin, xmax, ymin, ymax)
coord. ref. : GDA94 / Geoscience Australia Lambert (EPSG:3112) 
source(s)   : memory
name        : bioclim_01 
min value   :   5.030247 
max value   :  29.429588 

The projected raster is in the Geosciences Australia Lambert projection, and the extent, resolution, and dimensions have all changed. This CRS has units of meters, which means the size of each raster cell is 5123.95m x 5123.95m.

Let’s compare the statistics for the two rasters:

summary(climate1)
Warning: [summary] used a sample
   bioclim_01    
 Min.   : 4.854  
 1st Qu.:19.546  
 Median :22.583  
 Mean   :22.202  
 3rd Qu.:25.625  
 Max.   :29.441  
 NA's   :50934   
summary(climate1_projected)
Warning: [summary] used a sample
   bioclim_01    
 Min.   : 5.914  
 1st Qu.:19.828  
 Median :22.827  
 Mean   :22.410  
 3rd Qu.:25.877  
 Max.   :29.413  
 NA's   :58957   

The statistics are similar, but not exactly the same. There are also a lot more NA values in the projected raster, largely because there are more cells: 572721 in the original and 711480 in the projected raster.

These changes in the raster are really important to think about when making decisions about projecting rasters. In general avoid projecting rasters if you can. You can find more information about the methods you can use when project in the project() help file (?project).

Cropping and masking

The raster data we have at the moment is for a much larger area than just SEQ. Let’s plot our raster and SEQ polygon together to see:

seq_vect <- vect(seq_boundary) #we need to convert our sf polygon into a SpatVector, which is the vector format that the terra package uses

plot(climate1_projected) 
lines(seq_vect) #adds the SEQ polygon as lines on top of the raster

To get only SEQ data, we need to crop and mask the raster data using our SEQ polygon.

Cropping means that we keep only the data inside the extent of the vector we are using. Mask means that all the data outside the vector is set to NA or some other value we specify. Lets have a look how this works.

First lets have a look at the extent of the SEQ polygon. We can get the extent of a raster or vector using ext(). We need to convert this into a SpatVector object for plotting using vect(). We only need to do this for plotting; when we crop, we can just use the SEQ polygon as the input.

seq_vect_extent <- ext(seq_vect) |> 
  vect()

plot(climate1_projected)
lines(seq_vect)
lines(seq_vect_extent, col = "blue")

Cropping means we remove everything outside the extent (blue box) of our polygon. Masking sets all values outside our polygon to NA.

So when we crop, we get only the area within the blue box.

We crop using the crop() function, using the raster we want to crop as the first argument and the vector we are cropping as the second.

#crop
climate1_cropped <- crop(climate1_projected, seq_vect)

#plot
plot(climate1_cropped)
lines(seq_vect)

Now we have cropped our raster, we can mask it so that we only have values for the area within the SEQ boundary. We do this using mask:

#mask
climate1_cropped_masked <- mask(climate1_cropped, seq_vect)

#plot
plot(climate1_cropped_masked)
lines(seq_vect)

Now we only see raster values for cells that are within the SEQ boundary. But remember that the areas that are white, still have values, they are just NA values. We can confirm this by plotting the NA cells in grey (or any other colour):

plot(climate1_cropped_masked, colNA = "grey")
lines(seq_vect)

Often we want to crop and mask one after the other, and you can do this in one command using crop(climate1_projected, seq_vect, mask = TRUE).

For reference, here is a figure comparing what crop, mask and crop(mask = TRUE) do:

Code
par(mfrow = c(2,2))

plot(climate1_projected, main = "Original raster")
lines(seq_vect)

plot(climate1_cropped, main = "Cropped")
lines(seq_vect)

climate1_projected |>
  mask(seq_vect) |>
  plot(main = "Masked")
lines(seq_vect)

plot(crop(climate1_projected, seq_vect, mask = TRUE), main = "Cropped and masked")
lines(seq_vect)

Code
par(mfrow = c(1,1))

Why not just mask rather than crop and mask? As we see in the figure above, this would mean we have a lot of area we are not interested in and even though most of those cells would be NA they take up space in our raster, so it is not efficient.